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This  paper  presents  new  loop  transformation  techniques  that  can  extract  more  parallelism 
from  a  class  of  programs  than  existing  techniques.  A  formal  mathematical  framework 
which  unifies  the  existing  loop  transformation  techniques  is  given.  We  classify  schedules 
of  a  loop  transformation  into  three  classes:  uniform,  subdomain-variant,  and  statement- 
variant.  New  algorithms  for  generating  these  schedules  are  given.  Viewing  from  the 
degree  of  parallelism  to  be  gained  by  loop  transformation,  the  schedules  can  also  be  clas¬ 
sified  as  single-sequential  level,  multiple-sequential  level,  and  mixed  schedule.  We  describe 
iterative  and  recursive  algorithms  to  obtain  multiple-sequential  level  and  mixed  schedules 
respectively  based  on  the  algorithms  for  single-sequential  level  schedules. 

This  work  has  been  supported  in  part  by  the  Office  of  Naval  Research  under  Contract 
N00014-89-J-1906,  N00014-90-J-1987. 


1  Introduction 


One  of  the  central  issues  in  restructuring  compiler  is  to  discover  parallelism  automatically 
and  generate  correct  parallel  control  structures  that  can  take  advantage  of  the  large  number 
of  processors.  The  advent  of  massively  parallel  machines  opens  up  opportunities  for  pro¬ 
grams  that  have  large-scale  parallelism  to  gain  tremendous  performance  over  those  that  do 
not.  This  paper  presents  new  loop  transformation  techniques  that  can  extract  more  paral¬ 
lelism  from  a  class  of  programs  than  existing  techniques.  The  particular  class  of  programs 
are  those  that  consist  of  perfectly  nested  loops  possibly  with  conditional  statements  where 
the  guards  as  well  as  the  array  index  expression  are  affine  expressions  of  the  loop  indices. 


Organization  of  the  Paper  To  make  this  paper  self-contained,  we  describe  the  notations 
and  terminologies  of  the  basic  concepts  relating  to  data  dependence  and  loop  transformation 
in  Section  2.  We  then  present  a  formal  mathematical  framework  which  unifies  the  existing 
loop  transformation  techniques,  and  sets  the  stage  for  discussing  the  more  general  classes  of 
loop  transformers  in  Section  3.  A  loop  transformer  is  a  function  that  relates  a  given  loop  nest 
with  its  transformed  version,  and  consists  of  two  parts:  a  spatial  morphism,  and  a  temporal 
morphism,  called  a  schedule.  Next,  in  Section  4,  we  classify  schedules  by  the  properties  of 
uniformity  and  the  degree  of  parallelism  to  be  gained,  and  describe  the  functional  forms  of 
the  schedules  for  each  class.  Existing  loop  transformation  techniques  are  given  as  examples 
of  these  classes  of  schedules. 

In  Section  5,  we  review  Quinton’s  algorithm  for  obtaining  single-sequential  level  uniform 
schedules  and  present  the  problem  formulations  for  two  new  classes  of  schedules,  namely, 
subdomain  schedules  and  statement-variant  schedules  and  the  algorithms  to  generate  them. 
The  generation  of  subdomain  schedules  requires  non-linear  programming,  and  an  alternative 
heuristic  algorithm  using  linear  programming  is  given. 

Section  6  describes  an  iterative  algorithm  to  obtain  multiple-sequential  level  schedules 
based  on  the  algorithms  for  single-sequential  level  schedules.  Section  7  presents  a  recursive 
algorithm  to  generate  mixed  schedules  that  result  in  imperfectly  nested  loops,  again  using 
the  algorithms  for  single-sequential  level  schedules  as  the  basic  step. 

Finally,  we  illustrate  the  usefulness  of  the  new  loop  transformation  techniques  with  an 
example  program  in  Section  8.  Versions  of  the  transformed  program  using  different  schedules 
are  implemented  on  a  Connection  Machine  CM/2.  The  difference  in  performance,  which  is 
essentially  due  to  the  available  parallelism  determined  by  the  schedule,  can  amount  to  two 
orders  of  magnitude. 


Previous  Work  Numerous  techniques  such  as  statement  reordering,  loop  vectorization, 
interchanging,  permutation  and  skewing  used  in  restructuring  compilers  [1,  2,  3,  4,  5, 6,  7,  8, 
9,  20,  21,  33,  34,  37],  have  been  proven  effective  in  gaining  parallelism  for  vector  computers 
and  small-scale  shared  memory  parallel  machines. 

Much  work  in  the  area  of  mapping  recurrence  equations  to  systolic  architectures  [11,  12, 
15,  17,  19,  24,  26,  25,  27,  28,  29,  30],  in  contrast,  focuses  on  developing  algorithms  for  loop 
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skewing. 

Guerra  [14]  and  Lin  [22]  discussed  different  transformation  functions  over  different  sub- 
domains  of  the  the  iteration  space  of  a  loop  nest  very  similar  to  the  subdomain  schedules 
presented  here.  But  they  have  not  described  any  algorithms  for  generating  such  schedules. 

Delosme  [12]  and  Rao  [31]  discussed  different  transformation  functions  over  different 
subsets  of  the  set  of  loop  statements.  To  date,  there  has  not  been  any  automated  procedure 
for  doing  such  a  transformation. 

In  all  previous  work  on  loop  transformation,  dependence  vectors  and  dependence  direc¬ 
tion  vectors  are  all  that  are  needed.  And  for  the  type  of  loop  nests  of  interest,  there  are 
constant  numbers  of  such  vectors.  In  order  to  generate  subdomain  and  statement-variant 
schedules,  we  need  actually  to  capture  the  dependence  index  pair  where  a  dependence  rela¬ 
tion  occurs.  The  problem  is  that  there  are  many  such  pairs  that  need  to  be  considered,  and 
they  can  be  infinitely  many  when  the  loop  bounds  are  unknown  at  compile  time.  We  need 
to  rely  on  a  technique  called  polyhedra  decomposition  [13,  30,  32]  to  manage  the  complexity 
of  the  algorithm. 

Some  recent  work  attempting  to  formalize  loop  transformations  requires  the  transfor¬ 
mation  functions  to  be  unimodular  [7,  8,  33].  We  will  show  that  this  requirement  is  not 
essential,  and  allow  a  much  more  general  class  of  functions  to  be  used  as  a  loop  transformer. 


2  Definitions  and  Terminologies 

Throughout  this  paper,  programming  examples  are  written  in  a  Fortran-like  notation  al¬ 
though  the  transformation  techniques  also  apply  to  functional  languages. 

Index  Domains  Let  [a,  6]  be  an  interval  domain  of  integers  from  a  to  b.  We  define  an 
index  domain  D  (also  called  an  iteration  space  in  [34])  of  a  d-level  perfectly  nested  loop 

Loop  Nest  1 


DO  (ii  =  Zi,ui){ 

DO  (...){ 

DO  ( id  =  ld,Ud){ 

body}  }  } 

to  be  the  Cartesian  product  [/j,  «i]  x  . . .  x  [/^ ,  uj]  of  d  interval  domains  [/*,  u*]  for  1  <  k  <  d. 

For  the  purpose  of  formulating  loop  transformations,  we  consider  D  to  be  a  subset  of 
the  d-dimensional  vector  space  over  rationals.  Throughout  the  paper,  we  let  I  =  (tj,  ...,id) 
and  J  =  (ji, . . . ,  jd).  With  the  domain  and  tuple  notations,  Loop  Nest  1  can  be  rewritten 
as  follows: 
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Loop  Nest  2 


DO  (/:!>){ 
body  } 


In  this  paper,  we  locus  on  sequential  loop  nests  which  are  perfectly  nested.  We  use  the 
following  loop  nest  as  a  generic  example  throughout  the  paper,  where  D  is  a  d  dimensional 
index  domain  and  r[a]  is  an  expression  containing  a: 

Loop  Nest  L  (Generic  Loop  Nest) 

DO  (I:D){ 

Si:  IF (P1)A(X(I))=... 

S2:  \F(P2)B(Z(I))  =  t[A(Y(I))) 

} 

Data  Dependence  We  now  define  dependence  between  statements.  Let  Si  and  S2  be 
two  statements  of  a  program.  A  flow  dependence  exists  from  Si  to  S2  if  Si  writes  data 
that  can  subsequently  be  read  by  S2.  An  anti-dependence  exists  from  Si  to  S2  if  Si  reads 
data  that  S2  can  subsequently  overwrite.  An  output  dependence  exists  from  Si  to  S2  if  Si 
writes  data  that  S2  can  subsequently  overwrite.  We  use  the  notation  Si  =>  S2  to  denote  a 
dependence  from  Si  to  S2. 


Equivalence  Classes  over  Statements  in  a  Loop  Nest  Let  “=$•”  be  the  reflexive 
and  transitive  closure  of  the  dependence  relation  “=»”  over  statements.  We  define  a  binary 
operation  over  statements  where  Si  ~  S2  if  Si  4-  S2  and  S2  =>•  Si.  Note  that  is 
an  equivalence  relation,  and  therefore,  partitions  loop  statements  into  equivalence  classes 
(called  tt  blocks  in  [34]).  The  technique  of  loop  fission  [34]  can  be  applied  to  spilt  the  loop 
nest  into  several  new  loop  nests,  one  for  each  equivalence  class. 


Dependent  versus  Independent  Blocks  By  the  definition  of  the  equivalence  relation 
a  single  statement  which  is  not  self-dependent  can  form  an  equivalence  class  on  its 
own.  This  case  must  be  distinguished  from  all  others  where  cyclic  dependences  actually 
occur.  We  call  this  special  case  of  an  equivalence  class  under  the  dependence  relation  an 
independent  block,  and  others  dependent  blocks.  For  example,  consider  the  following  loop 
nest: 
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Loop  Nest  3 


DO  (i  =  l,n)  { 

DO  (j  =  i  +  1,  n)  { 

51  :  A(i,j  -  i)  =  B(i,j  -  i  -  1) 

52  :  -B(t,  j  -  i)  =  A(t  -  1  ,j  -  i) 

53  :  C(i,j)  =  A(i,j)+  B(i-l,j  +  l)}} 

Statements  Si  and  S2  are  in  the  same  dependent  block  because  of  cyclic  flow  dependences, 
and  statement  S3  forms  an  independent  block.  Loop  fission  can  be  applied  to  split  Loop 
Nest  3  into  two  new  loop  nests: 

Loop  Nest  4 

DO  (*  =  l,n){ 

DO  (j  =  i  +  l,n){ 

51  :  A(i,j  -  i )  =  B(i,j  -  i  -  1) 

52  :  B(i,j  -  i)  =  A(i  -  1  ,j  -  i)  }  } 

DO  (t  =  l,n)  { 

DO  (j  =  t  +  l,n){ 

53  ■■  C(i,j)  =  A(i,j)  +  B(i  -  1  ,j  +  1)  }  } 

Loop  fission  can  be  used  to  separate  an  independent  block  from  other  dependent  blocks, 
and  the  new  loop  nest  consisting  of  one  independent  block  is  readily  parallelizable.  Similarly, 
loop  fission  can  be  used  to  transform  a  loop  body  containing  multiple  dependent  blocks 
into  multiple  loop  nests,  each  with  a  single  dependent  block.  We  therefore  consider  the  case 
that  Loop  Nest  L  consists  of  one  dependent  block  for  the  rest  of  the  paper.  A  loop  nest 
consisting  of  a  dependent  block  may  be  parallelized  by  several  techniques,  namely  statement 
reordering,  loop  vectorization,  interchanging  and  permutation  [7, 34].  To  determine  whether 
these  transformations  are  applicable  or  not,  the  notion  of  a  direction  vector  [34]  is  necessary. 

Direction  Vectors  Consider  Loop  Nest  L.  For  statement  52  to  compute  the  value  B(Z(J )) 
at  iteration  J,  the  value  A(Y(J))  is  needed.  If  A(T(J))  is  computed  from  statement  Si 
at  iteration  /,  i.e.  Y(J )  =  X(I),  then  we  say  S2  at  iteration  J  is  flow  dependent  on  5j  at 
iteration  I,  denoted  by  Si@I  =>  S2©J. 

For  a  dependence  5i@7  =>  S2©J ,  the  vector  (sig(t'i  -  ji), . .  .,sig(i«j  -  jj))  is  called  a 
direction  vector  from  5i  to  S2  [34],  where  sig  is  a  function  from  the  set  of  integers  to  the 
set  of  ordering  relations  “<”,  and 

2  <  0  -►  “<” 
sig(z)  =  •  z  -  0  — ►  “=” 

2  >  0  — 
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Dependence  Vectors  Loop  skewing  is  another  transformation  which  may  expose  more 
parallelism,  if  the  parallelism  gained  from  the  above-mentioned  techniques  is  insufficient. 
In  order  to  do  loop  skewing,  we  need  to  know  the  relative  positions  of  index  tuples  7  and  7 
for  each  dependence  S\@I  =>  [25,  34].  For  a  dependence  Si @7  ^  S2@J,  the  vector 

(ji  —  t‘i, . . . ,  jd  -  id)  is  called  a  dependence  vector  from  Si  to  S2  [34]. 

Since  I  and  I  are  in  the  d-dimensional  vector  space,  we  use  7+ J  to  denote  the  addition  of 
tv/o  vectors 7 and  J,  i.e.  7+7  =  (ti+ji, ..., id+jd);  and  similarly,  7-7  =  (ji—ii,...,jd—*d)- 


Notation  for  Concatenation  Since  we  will  be  using  matrix  and  vector  notations  ex¬ 
tensively,  we  define  the  notation  for  matrix  concatenation  here.  We  treat  a  row  vector  of 
length  d  as  a  degenerate  1-by-d  matrix,  a  column  vector  of  length  d  as  a  degenerate  d-by-1 
matrix,  and  a  scalar  as  a  degenerate  1-by-l  matrix. 

A  horizontal  concatenation  of  an  i-by-m  matrix  A  and  an  l-by-n  matrix  B,  denoted  by 
[ A ,  B],  is  an  /-by-(m  +  n)  matrix,  where  the  (t,  j)-th  element  of  [A,  B ]  is  equal  to  the  (i,j)-th 
element  of  A  if  j  <  m,  or  it  is  equal  to  the  ( i,j  —  m)-th  element  of  B  if  j  >  m. 


A  vertical  concatenation  of  an  m-by-l  matrix  A  and  an  n-by-l  matrix  B,  denoted  by 


A 

B 


,  is  an  (m  +  n)-by-/  matrix,  where  the  (i,j)- th  element  of 


A 

B 


is  equal  to  the 


(i,j)- th  element  of  A  if  i  <  m,  or  it  is  equal  to  the  (i  -  m,,7')-th  element  of  B  if  i  >  m. 


3  Formalizing  Loop  Transformation 

We  now  formalize  the  notion  of  loop  transformation  from  a  source  loop  nest  to  a  target 
parallel  loop  nest.  A  loop  transformer  is  a  function  defined  over  the  Cartesian  product 
of  the  iteration  space  of  the  loop  nest  and  the  set  of  statements  in  the  body  of  the  loop 
that  relates  a  given  loop  nest  with  its  transformed  version.  From  the  standpoint  of  symbolic 
transformation  of  the  program  text,  a  loop  transformer  can  be  decomposed  into  two  compo¬ 
nents:  the  first  component,  called  domain  morphism ,  defines  how  the  iteration  space  should 
be  mapped  to  a  new  one  (with  new  loop  bounds  and  possibly  new  predicates  guarding  the 
loop  body),  and  the  second  component,  called  statement  reordering  function,  defines  the 
ordering  of  the  statements  in  the  transformed  loop  nest.  The  process  of  obtaining  a  loop 
transformer,  however,  suggests  another  decomposition:  a  temporal  morphism  and  a  spatial 
morphism. 


3.1  Loop  Transformer  and  Schedule 

Kinds  of  Index  Domains  For  the  purpose  of  loop  transformation,  it  is  useful  to  indicate 
how  the  index  domain  shall  be  interpreted.  We  do  this  by  defining  kinds  of  index  domains. 
The  kind  of  an  interval  domain  D  can  be  either  spatial  or  temporal.  The  kind  of  a  product 
domain  is  the  product  of  the  kinds  of  the  component  domains.  For  example,  7?j  x  D2  is  of 
kind  temporal x spatial  if  D\  is  of  kind  temporal  and  7?2  is  of  kind  spatial.  A  single-level 
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loop  with  a  temporal  index  domain  corresponds  to  a  sequential  loop  (i.e.  DO),  while  a 
spatial  index  domain  corresponds  to  a  parallel  loop  (i.e.  DOALL). 


Lexicographical  Ordering  We  use  the  following  notations  to  denote  lexicographical 
ordering  on  elements  X  and  Y  of  an  n-dimensional  index  domain.  We  define  to  be  the 
lexicographical  ordering:  we  say  X  ■<  Y  if  there  exists  k,  1  <  k  <  n,  such  that  xi  =  yi  for 
all  1,1  <  k,  and  Xk  <  yk •  Similarly,  we  say  X  ■<  Y  if  X  ■<  Y  or  Xk  =  Vk  for  all  fc,  1  <  k  <  n. 
We  use  0  to  denote  the  zero  vector. 


Domain  Morphism  We  define  a  domain  morphism  to  be  a  bijective  function  g  from 
index  domain  D  to  index  domain  £,  denoted  by  g:D  — >  E ,  such  that  for  all  dependences 
Si @7  =*>  S2@7,  condition  g(J)  —  g(I)  V  0  holds.  In  other  words,  a  domain  morphism  will 
never  reverse  the  ordering  imposed  by  dependence  relations. 

In  this  paper,  we  restrict  the  codomain  E  of  a  domain  morphism  to  be  a  cross  product 
of  a  temporal  index  domain  E\  and  a  spatial  index  domain  E2,  i.e.  E  =  Ei  x  E2.  Under 
this  restriction,  all  parallel  loops  are  innermost  loops  in  the  transformed  loop  nest.  We 
define  <71  and  <72  to  be  two  functions: 

<71  :  D  —*  E\,  (called  a  temporal  morphism)  and  (2) 

<72  :  D  —*  E2 ■  (called  a  spatial  morphism)  (3) 

Under  domain  morphism  g,  index  7  in  the  original  loop  will  be  mapped  to  index  J  =  g(I) 
in  the  transformed  loop  nest.  Since  g  is  bijective,  it  has  a  well-defined  inverse,  denoted  by 
<7_1.  Clearly,  7  =  g~1(J).  The  following  loop  nest 

Loop  Nest  5 


DO  ((7:7)))  { 

...  A(X(I))  ...} 

will  be  transformed  into  the  following  new  loop  nest  under  domain  morphism 
g:D  -*■  Ei  x  E2 : 

Loop  Nest  6 


DO  ((J1:£1)){ 

DOALL  ((J2 :£*)){ 


...  A(X(g-' 


J 1 
h 


)))  •••}} 


where 


Ji 

h 


denotes  the  vertical  concatenation  of  two  column  vectors  J\  and  J2. 
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The  requirement  of  g  to  be  surjective  is  in  fact  not  essential.  For  any  injective  function 
g':D  — ►  E,  we  can  always  derive  a  corresponding  bijective  function  g:D  —*  {g'(I)  \  I  £  D) 
from  D  to  the  image  of  D  under  g '  [11].  Therefore,  by  allowing  the  codomain  of  a  bijective 
function  to  be  the  image  of  an  injective  function,  we  allow  a  much  more  general  class  of 
functions  to  be  used  as  domain  morphism.  For  comparison,  the  unimodular  transformations 
discussed  in  [8,  33]  are  special  classes  of  bijective  functions.  The  generality  does  require  some 
nontrivial  algebraic  manipulation  to  generate  correct  loop  bounds  and  predicates  to  guard 
the  conditional  statements  in  the  transformed  loop  nest.  An  automatic  transformation 
procedure  for  doing  this  based  on  an  equational  theory  is  described  in  [11]. 


Statement  Reordering  We  now  discuss  statement  reordering.  Let  S  denote  the  set  of 
statements  in  the  loop  body.  We  define  a  statement  reordering  to  be  a  function  h  from  the 
set  of  statements  to  the  set  of  statement  labels: 

r:S  — ►  [0,  s  —  1],  where  s  =  |<S|,  the  number  of  statements  in  S.  (4) 


Loop  Transformer  With  g  and  r  defined  above,  the  following  function  h,  called  the  loop 
transformer,  specifies  how  a  loop  nest  is  transformed: 

h:D  x  S  -*  Ex  x  E2  x  [0,s  -  1] 

h(I,S)  =  (9l(I),g2(I),r(S)).  (5) 

Schedule  Given  h  defined  above,  a  schedule  it  is  defined  to  be  a  function 

X  S  X  [0,a  —  1] 

*(I,S)=(9l(I),r(S)), 

such  that  condition  it(J,  S2)  —  *■(/,  5j)  >-  6  must  hold  for  all  dependences  5i@7  =>  S2©J 
in  the  loop  nest.  The  condition  ensures  that  the  ordering  imposed  by  dependence  relations 
is  preserved.  Clearly,  a  schedule  determines  the  sequential  execution  of  the  transformed 
parallel  loop  nest.  Note  that  by  the  definition  of  domain  morphism,  g\(J )  —  g\{I)  can  be 
equal  to  the  zero  vector,  i.e.  Sx@I  and  S2©J  can  be  computed  at  the  same  iteration  in  the 
transformed  loop  nest.  In  this  case,  statement  Si  must  be  in  front  of  statement  S2  in  the 
loop  body,  i.e.  condition  r(Si)  <  r(S2)  must  hold,  to  preserve  the  dependence  ordering. 


3.2  Overall  Procedure  to  Obtain  a  New  Loop  Nest 

Finding  a  schedule  it  is  to  understand  what  is  the  potential  parallelism  that  can  be  extracted 
from  the  source  program.  There  may  be  alternative  schedules  which  are  incomparable 
without  a  target  machine  model.  Traditional  loop  transformation  uses  an  ad  hoc  approach 
in  choosing  a  particular  schedule  out  of  several  alternative  ones.  A  systematic,  cost-driven 
approach  to  choose  alternative  schedules  is  beyond  the  scope  of  this  paper.  This  paper  gives 
algorithms  to  find  schedules  which  result  in  maximal  parallelism  of  the  innermost  loops. 
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The  so-called  strip  mining  [34]  and  tiling  [33,  36]  of  loops  are  captured  by  the  6patial 
morphism  g2.  The  choice  of  <72  >  which  depends  on  factors  such  as  memory  and  processor 
organization  and  communication  cost,  can  be  dealt  with  separately  and  is  not  included  in 
this  paper.  However,  given  a  schedule  it  =  (51,  r),  a  valid  <72  should  keep  a  loop  transformer 
h  =  (gi,g2ir)  injective.  In  this  paper,  we  use  the  following  default  spatial  morphism  for 
all  the  examples:  g2(ii,  •  •  -  ,id)  =  ( iPi , . .  • ,  tPn),  so  as  to  result  in  a  loop  transformer  h  that 
is  injective,  where  n  is  the  dimensionality  of  the  spatial  index  domain  £2,  {pi,. . . ,pn }  is  a 
subset  of  interval  domain  [1,  d],  and  p\  '  ...  <  pn. 


Overall  Procedure  To  summarize,  the  overall  procedure  to  obtain  a  new  loop  nest  is: 

1.  First  generate  a  schedule  it  —  ( g\,r )  to  maximize  the  degree  of  parallelism. 

2.  Then  determine  the  spatial  morphism  g 2  of  domain  morphism  based  on  target  machine 
characteristics  such  as  memory  and  processor  organization,  communication  cost,  etc., 
or  use  a  default  function  as  shown  above. 

3.  The  loop  transformer  is  simply  h  =  (ffi,<72,r)- 

4.  Finally  perform  symbolic  program  transformation,  given  the  source  loop  nest  and  loop 
transformer  h ,  to  obtain  the  new  loop  nest.  For  the  formal  procedure,  please  refer  to 
[11]. 

The  remainder  of  this  paper  is  devoted  to  generating  it  to  gain  large  scale  parallelism. 


4  Classes  of  Affine  and  Piece- Wise  Affine  Schedules 

We  call  a  schedule  affine  if  it  is  an  affine  function  of  the  loop  indices.  We  call  a  schedule 
piece-wise  affine  if  the  restriction  of  the  function  to  each  subdomain  of  D  and  each  subset 
of  S  is  affine.  In  the  loop  restructuring  literature,  only  affine  schedules  are  considered.  In 
this  paper,  we  consider,  in  addition,  piece-wise  affine  schedules. 

In  order  to  discuss  the  algorithms  for  generating  suitable  schedules,  we  now  classify 
them  according  to  two  properties:  (1)  the  uniformity  of  the  schedule  with  respect  to  the 
the  set  of  statements  S  and  the  index  domain  D,  and  (2)  the  degree  of  parallelism  in  the 
transformed  Loop  Nest. 


4.1  Properties  of  Schedules 

Uniformity  Let  index  domain  D  be  partitioned  into  m  disjoint  subdomains  £>*,  1  < 
k  <  m;  and  let  the  set  of  statements  S  be  partitioned  into  n  disjoint  subsets  $*,  1  <  k  < 
n.  The  general  form  of  a  piece-wise  affine  schedule  it  defined  in  Equation  (6)  consists  of 
conditional  branches,  one  for  each  pair  of  subdomain  Di  and  statement  subset  Sj,  and  an 
affine  expression  of  the  loop  indices  is  on  the  right-hand  side  of  each  branch.  We  call  a 
schedule 
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1.  uniform  if  m  =  1  and  n  =  1, 

2.  subdomain-variant  if  m  >  1  and  n  =  1,  (also  called  a  subdomain  schedule) 

3.  statement-variant  if  m  =  1  and  n  >  1,  or 

4.  nonuniform  if  m  >  1  and  n  >  1. 


Degree  of  Generated  Parallelism  As  defined  in  Equations  (3)  and  (6),  the  dimension¬ 
ality  of  the  temporal  index  domain,  indicates  the  number  of  levels  of  sequential  loops  in 
the  transformed  loop  nest.  Hence  a  schedule  it  would  generate  a  target  loop  nest  with  more 
levels  of  parallel  loops  and  thus  potentially  more  parallelism  if  E\  is  of  lower  dimensionality. 
We  call  the  dimensionality  of  Ei  the  sequential  level  of  jr.  Schedules  can  thus  be  classified 
as: 

1.  Single-sequential  level  schedule  if  E\  is  a  subset  of  the  set  of  natural  numbers  A f. 

2.  Multiple-sequential  level  schedule  if  E\  is  a  subset  of  AT",  where  n  is  a  positive  integer 
and  n  <  d,  the  dimensionality  of  the  original  loop  nest. 

3.  Mixed  schedule  if  E\  can  be  of  different  dimensions  for  each  pair  of  subdomain  Z), 
and  statement  subset  Sj.  Such  a  mixed  schedule  will  result  in  transformed  programs 
consisting  of  imperfectly  nested  loops. 

4.2  Classification  and  Functional  Form  of  Schedules 

Classification  Clearly,  the  uniformity  of  x  and  the  dimensionality  of  x  are  two  orthog¬ 
onal  properties,  except  that  a  mixed  schedule  cannot  be  uniform.  Thus  there  are  all  to¬ 
gether  eleven  (4  *  3  —  1)  classes  of  affine  and  piece- wise  affine  schedules.  The  classes  and 
their  acronyms  ranging  from  single-sequential  level  uniform  schedules  to  mixed  nonuniform 
schedules  are  given  below: 


Single- Sequential 
Level  (SSL) 

Multiple- Sequential 
Level  (MSL) 

Mixed 

Uniform  (U) 

SSL-U 

MSL-U 

Subdomain  (SD) 

SSL-SD 

MSL-SD 

Mixed-SD 

Statement- Variant  (SV) 

SSL-SV 

MSL-SV 

Mixed-SV 

Nonuniform  (NU) 

SSL-NU 

MSL-NU 

Mixed-NU 

Functional  Form  We  now  describe  the  forms  of  affine  and  piece-wise  affine  schedules  by 
using  matrix  and  vector  notations.  Let  r(S)  for  a  given  5  in  5  be  a  constant  scalar.  Let  d 
be  the  dimensionality  of  the  index  domain  of  the  source  loop  nest. 
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Uniform  Schedule: 


x(ItS)=(TI,r(S)),  I  E  D,  S  E  S,  (7) 

where  T  is  a  constant  l-by-d  matrix  and  /  is  the  sequential  level  of  the  schedule  ir. 
Subdomain  Schedule: 

’  I  e  Dt -+ (TiI'T^S)) 

jt(7, 5)  =  -  ...  >,  I  eD,S  eS,  (8) 

I  E  Dm  ^  (TmI,rm(S)) 

where  T),  1  <  i  <  m,  is  a  constant  l{-by-d  matrix  and  is  the  sequential  level  of  the  part 
of  the  schedule  defined  over  Z),. 

Statement- Variant  Schedule: 

[se51-»(Ti/,r(S))l 

IED,SES ,  (9) 

[S£Sn-+(TnI,r(S))\ 

where  T,-,  1  <  t  <  n,  is  a  constant  l,-by-d  matrx  and  /,  is  the  sequential  level  of  the  part  of 
the  schedule  defined  over  S;. 

Nonuniform  Schedule: 

(J€Di)-|...  > 

[(SESn)-+(TlnI,rl(S))\ 

tt(/,5)='{  ..  I  £  D,S  E  S,  (10) 

{(S  E  S\)  -*  (TmiI,rm(S)) 

... 

(5  E  Sn)  — *  (Tmn7,rm(S)) 

where  Tih  1  <  i  <  m  and  1  <  j  <  n,  is  a  constant  Uj-by-d  matrix  and  is  the  sequential 
level  of  the  part  of  the  schedule  defined  over  D,  and  Sj. 

The  linear  term  TI,  I  £  D,  determines  the  form  of  the  sequential  loops  in  the  trans¬ 
formed  loop  nest,  which  includes  nesting  structures,  bounds,  and  possibly  additional  pred¬ 
icates  to  guard  the  loop  body.  The  constant  terms  r(S)  determine  the  orders  of  the  state¬ 
ments  in  the  transformed  loop  body. 
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4.3  Examples  of  Different  Classes  of  Schedules 


We  now  give  some  examples  of  different  classes  of  schedules.  We  first  show  that  loop  vec- 
torization,  interchanging,  permutation  and  skewing  are  special  cases  of  multiple-sequential 
uniform  schedules. 


Example  1:  Loop  Vectorization  Let  loc  be  a  function  from  S  to  Af  that  returns  the 
position  of  the  statement  5  in  the  source  loop  nest.  Suppose  a  dependence  test  says  that 
m  innermost  loops  can  be  parallelized.  The  schedule  for  the  so-called  loop  vectorization  of 
the  d  —  m  outermost  loops  is  of  the  following  form,  where  d  is  the  dimensionality  of  the 
index  domain  of  the  loop  nest: 


*•(/,  5)  =  (i|,  *2,  •  •  • ,  id-m,  loc(S)), 
1  V(l)  \ 


i.e.  T  = 

\v(d-m)) 

r(S)  =  loc(S), 


,  and 


(11) 

(12) 

(13) 


where  each  V(k)  is  a  vector  of  length  d  with  fc-th  element  being  1  and  all  other  elements 
being  0. 


Example  2:  Loop  Interchanging  and  Permutation  Loop  interchanging  and  loop 
permutation  [1,  2,  3,  7,  34,  35,  37]  is  a  process  of  switching  inner  and  outer  loops.  Suppose 
Loop  Nest  1  after  loop  interchanging  or  loop  permutation  becomes 


DO  (*pi  —  >  upi  )  { 

DO  (...){ 

DO  (*p<*  =  ^Pd>uPd)i 

body  }  }  }, 

where  (P\,pi, .  ■  -,Pd)  is  a  permutation  of  (1,2,.  ..,d).  Also  suppose  the  m  innermost  loops 
are  parallelizable.  The  schedule  n  has  the  form: 


—  (ip j  ,  ip2 ,  •  •  •  »  i°C(^))> 


1  V(Pl)  \ 


i.e.  T  = 


,  and 


\V(Pd-m)J 

r(S)  =  loc(5), 


(14) 

(15) 

(16) 


where  each  V(k)  is  the  same  as  defined  :n  Example  1. 
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Example  3:  Loop  Skewing  This  operation  transforms  Loop  Nest  1  as  follows:  shifting 
index  i„  with  respect  to  index  im,  1  <  m  <  n  <  d,  by  a  factor  of  /,  where  /  is  a  positive 
integer,  replacing  ln  with  the  expression  (1„  +  im  *  /),  replacing  u„  with  the  expression 
(un  +  *  /),  and  replacing  all  occurrences  of  i„  in  the  loop  with  the  expression  ( in  -  im  *  f) 

[34,  37].  The  transformed  loop  nest  is  of  the  form: 

Loop  Nest  7 

DO  (ti  =  Jx.tuM 


DO  (l„  —  /n  4  im  *  /)  4"  *m  *  /)  { 


DO  (id  =  ld,Ud)  { 

same  loop  body  but  with  in  being  replaced  by  (t„  —  tm  *  /)  }  }  } 

The  schedule  for  such  so  called  loop  skewing  is  of  the  form: 

7r(/ ,  S )  =  (t'i ,  •  •  • ,  imi  •  •  • ,  jn  4  /  *  im  » •  •  •  >  i<ti  loc(S)), 

n-th  element 
/  V(l)  \ 


(17) 


i.e.  r  = 


V(n)  4  /  *  V(m) 


,  and 


\  V(d)  ) 
r(5)  =  loc(5), 

where  each  V(k)  is  the  same  as  defined  in  Example  1. 


(18) 


(19) 


Example  4:  Single-Sequential  Level  Uniform  Schedule 
Loop  Nest  8 

DO  (t  =  l,n)  { 

DO  (j  =  l,n)  { 

S\  :  A(i,j)=  B(i,j  -  l)4t 
S2:  B(i,j)  =  A(i  —  1,  j)  4  i  }  } 

A  single-sequential  level  uniform  schedule 

=  (t',1),  and  (20) 

*({i,j,k),S2)  =  (i,  0),  (21) 

will  transform  Loop  Nest  8  into 
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Loop  Nest  9 


DO  (t  =  l,n)  { 

DOALL  (j  =  l,n){ 

52  :  =  A(i-l,j)  +  j 

5i  :  A(i,j)  =  B(i,j  -  1)  +  i  }  } 


Example  5:  Multiple-Sequential  Level  Uniform  Schedule 

Loop  Nest  10 

DO  (*  =  n  -  1,1, -1){ 

DO  (j  =  i  +  l,n){ 

DO  (k  =  i,j){ 

51  :  IF(i+  1  =  k)  B(i,j,k)  =  C(i  +  1  ,j,j) 

52  :  IF(i  4-  1  <  k)  B(i,j,  k)  =  B(i  +  1  ,j,  k) 

53  :  IF(t  +  j  +  1  <  2 k)C(i,j,k)  =  C(i,j,k-  1)  +  B(i,j,k)  }  }  } 

A  two-sequential  level  uniform  schedule 

*((*\ h  k),  S)  =  ((- i ,  k ),  loc(5))  (22) 

will  transform  Loop  Nest  10  into 

Loop  Nest  11 

DO  (t  =  1  -  n,  —1)  { 

DO  ( k  =  —  i,  n )  { 

DOALL  (j  =  1  -  i,  n)  { 

5X  :  IF((— i+  1  =  k)  A(k  <  j))  B(-i,j,k)  =  C(1  -  i,j,j ) 

52  :  IF((— i  +  1  <  k)A(k  <  j))  B(-i,j,k)  =  B{  1  -  i,j,k ) 

53  •  IF((  — t  -F  j  -F  1  <  2k)  A  (k  <  j )) 

C(-i,j,k)  =  C(-i,j,  k  —  1  )  +  B{-i,j,k)  }  }  } 


Example  6:  Mixed  Statement- Variant  Schedule  Consider  Loop  Nest  10  again.  The 
following  schedule 


x((i,j,k),S)  = 


=  /  s  =  S3^((-t,*),|OC(S))l 
else  — ►  (-t,loc(5))  J 


(23) 


transforms  Loop  Nest  10  to  Loop  Nest  12  below,  which  consists  of  imperfectly  nested  loops: 
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Loop  Nest  12 


DO  (i  =  1  —  n,  —1)  { 

DOALL  ((j  =  1  -  i,  n),  (A:  =  -i,  n))  { 

51  :  IF((— i  +  1  =  k)  A  (*  <  j ))  B(-i,j,k)  =  C(  1  -  1,7,7) 

52  :  IF((— i  +  1  <  A:)  A  (k  <  j))  B(-i,j,k )  =  B(  1  -  i,j,k )  } 
DO  (A:  =  —  i,  n )  { 

DOALL  ( j  =  1  -i,n){ 

S3  :  IF(( — t  +  7  +  1  <  2k)  A  (A:  <  7')) 

C(-iJ,k)  =  C(-iJ,k-  1)  +  B(—i,j,k)  }  }  } 


Example  7:  Single-Sequential  Level  Subdomain- Variant  Schedule  Another  pos¬ 
sible  transformation  of  Loop  Nest  10  is  the  schedule: 


A;),  S) 


i  +  j  -2k  <0 
i  +  j  -  2k  >  0 


(-2i  +  j  +  A:,loc(5))l 
(— 1  +  2j  —  k,  loc(5))  J  ’ 


(24) 


which  transforms  Loop  Nest  10  into: 


Loop  Nest  13 

DO  ( t  =  2,2n-  2){ 

DOALL  (i  =  n  -  1,1,  — 1)  { 

DOALL  (j  =  t  +  1,  n)  { 

Su  :  IF((2 1  +  3t  -  3j  >  0)  A  (t  +  i  -  j  -  1  =  0)) 

B(i,j,t  +  2 i  -  j )  =  C(i  +  1  ,j,j) 

Su  :  IF((2 1  +  3 i-  3 j  >  0)  A  (i  +  2i  -  2 j  +  1  =  0)) 

-t-i  +  2 j)  =  C(i  +  1  ,j,j) 

521  :  IF((2 1  +  3t  -  Zj  >  0)  A  (t  +  i  -  j  -  1  >  0)) 

J9(i,  j,  t  +  2i-j)  =  B(i  +  !,>,<  +  2j  -  7) 

522  :  IF((2t  +  3i  -  3j  >  0)  A  (t  +  2t  -  2j  +  1  <  0)) 

B(i,  j,  —t  —  i  +  2 j)  =  B(i  +  l,j, -t  -  2i  +  2j) 

53 1  :  I F((2£  +  3i  -  3 7  >  0)  A  (21  +  3i  -  Zj  -  1  >  0)) 

C(i,j,t  +  2 i  -  j)  =  C(i,j,t  +  2i  -  7  -  1)  +  B(i,j,t  +  2t  -  7) 

532  :  IF((2<  +  3t  -  37  >  0)  A  (21  +  3i  -  37  +  1  <  0)) 

C(i,j,  —t  —  i  +  2j)  =  C(i,j,  -t-i  +  2j-l)  +  B(i,j ,  -1,  -i  +  27)  }  }  } 


Since  there  are  two  affine  functions  for  disjoint  subdomains  of  the  index  domain  of  the  loop 
nest,  each  statement  in  Loop  Nest  10  results  in  two  guarded  statements  in  the  transformed 


loop  nest.  In  fact  Loop  Nest  10  is  part  of  the  dynamic  programming  code  presented  in 
Section  8.  As  one  can  see,  an  SSL-SD  schedule  can  result  in  code  of  considerable  complexity. 
It  would  be  a  very  tedious  and  error-prone  process  for  a  user  to  write  the  code  by  hand. 
But  a  compiler  can  generate  the  new  loop  nest,  given  the  schedule,  and  the  original  loop 
nest  mechanically. 


5  Algorithms  for  Generating  Single-Sequential  Level  Sched¬ 
ules 

Our  algorithms  for  generating  the  various  single-sequential  level  schedules  are  based  on 
Quinton’s  algorithm  for  generating  SSL-U  schedules  [13,  29,  30].  To  make  the  paper  self- 
contained,  we  first  review  Quinton’s  algorithm.  We  then  present  two  algorithms,  one  for 
generating  SSL-SD  schedules  and  the  other  for  SSL-SV  schedules. 

5.1  Previous  Work:  Uniform  Scheduling  Algorithm 

Quinton’s  approach  addresses  the  analysis  and  mapping  of  linear  recurrence  equations  [13, 
29,  30].  We  formulate  Quinton’s  algorithm  in  the  context  of  loop  transformations. 

5.1.1  Problem  Formulation 

Constraints  Derived  from  Data  Dependences  For  an  SSL-U  schedule 
7 r(/,  5)  =  ( TI ,  r(S)),  where  T  is  a  row  vector,  the  inequality 

(TJ,  r(S2))- (17,  r(5,))>- 6,  (25) 

must  hold  for  all  index  tuples  I  and  J  in  index  domain  D  and  statements  S\  and  5 2  in 
S  such  that  S\@I  =►  S2@J.  We  first  focus  on  the  problem  of  obtaining  T  satisfying  the 
following  more  stringent  condition: 

TJ-TI  =  T(J  -  I)  >  0  (26) 

for  each  dependence  Si @7  =>  S2©J.  If  such  T  exists,  then  Equation  (25)  also  holds  for  all 
dependences  S\©I  =>  S2@J  due  to  the  lexicographical  ordering  “V”  regardless  of  what  r  is. 
In  this  case,  the  ordering  among  statements  can  be  arbitrary.  How  to  obtain  T  satisfying 
less  stringent  conditions,  and  use  r,  in  addition  to  T,  to  preserve  the  ordering  imposed  by 
dependences,  will  be  discussed  in  Section  5.2. 


Space  of  Dependence  Vectors  It  is  clear  that  in  the  case  of  uniform  schedule,  depen¬ 
dence  vectors  J  —  I  are  sufficient  for  obtaining  T.  We  now  formulate  the  set  of  dependence 
vectors  for  conditional  statements. 
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Let  D  be  an  index  domain  and  P  be  the  predicate  in  a  conditional  statement  S.  We 
define  the  index  domain  of  statement  S  to  be  D  under  the  restriction  of  P,  denoted  by  D[P: 

D[P  =  {/  |  /  6  D,and  P(I )  is  true}.  (27) 

For  the  rest  of  the  paper,  we  consider  only  flow  dependence  in  Loop  Nest  L.  Anti-dependence 
and  output  dependence  can  be  treated  similarly.  In  this  case,  the  set  of  dependence  vectors 
from  statement  S\  to  S2  in  Loop  Nest  L,  denoted  by  V(5i,Sa),  is  [23]: 

V(S1,S2)  =  {J-I\I6(DIP1),J6(DIP2),  and  X(I)  =  Y(J)}.  (28) 

Input  System  As  described  before,  each  dependence  relation  5i@I  =>  S2@J  defines  an 
inequality  T(J  —  I)  >  0  that  row  vector  T  must  satisfy.  We  call  the  set  of  all  constraints 
on  T  the  input  system,  denoted  by  C: 

C  =  {T(J  —  /)  >  0  ]  there  exist  statements  Si  and  S2  in  S,  and  index  tuples 
I  and  J  in  D ,  such  that  Si©I  = »  S2@J} 

=  {T(J  —  I)  >  0  |  there  exist  statements  Si  and  S2  in  S,  such  that 

(J-/)6V(51,52)>. 


Polyhedra  and  Polytopes  There  can  be  many  dependence  vectors  that  need  to  be 
considered,  and  they  can  be  infinitely  many  when  the  loop  bounds  are  unknown  at  compile 
time.  We  need  to  rely  on  a  technique  that  decomposes  a  polyhedron  into  vertices  and 
extremal  rays  [13,  30,  32]  to  manage  the  complexity  of  the  algorithm.  We  first  define  what 
a  polyhedron  is. 

Let  A  be  a  c-by-d  matrix  and  I?  be  a  column  vector  of  length  d.  Let  Z  be  the  set  of 
rationals.  A  polyhedron  is  a  subspace  of  the  d-dimensional  vector  space  Zd  that  can  be 
expressed  as  {/  |  AI  >  B}  [32].  A  bounded  polyhedron  is  called  a  polytope  [32]. 

Constraints  on  the  Input  System  In  the  loop  restructuring  literature,  only  rectangular 
and  trapezoid  index  domains  of  loop  nests  are  considered  [6,  34].  Note  that  rectangular  and 
trapezoid  index  domains  are  special  classes  of  polyhedra  [23].  In  order  to  obtain  uniform 
schedules  systematically,  Quinton  restricts  the  index  domain  of  each  recurrence  equation 
to  be  a  polyhedron  and  all  subscript  functions  (called  index  mappings  in  [30])  to  be  affine 
expressions  of  the  indices  used  in  defining  the  index  domains.  In  Fortran  like  programs, 
the  above  restriction  is  translated  to  perfectly  nested  loops  where  loop  bounds  at  one  level 
may  depend  on  the  outer  levels,  with  the  loop  body  consisting  of  conditional  statements 
where  all  the  predicates  of  conditionals  and  all  subscript  functions  are  affine  expressions  of 
the  loop  indices.  Under  these  restrictions,  Quinton  shows  that  [30]  the  set  V(5j,52)  of  the 
dependence  vectors  from  statement  Si  to  52  is  a  d-dimensional  polyhedron,  where  d  is  the 
dimensionality  of  the  index  domain  of  the  loop  nest. 
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A 


Figure  1:  Point  (1,0)  is  the  vertex  and  two  vectors  (1,2)  and  (2,1)  are  the  extremal  rays. 


5.1.2  Decomposing  a  Polyhedron  into  Vertices  and  Extremal  Rays 

We  now  discuss  how  to  represent  polyhedron  V(Si,  S2)  by  its  vertices  and  extremal  rays. 


Vertices  and  Extremal  Rays  Any  polyhedron  can  be  decomposed  into  a  finite  set 
of  vertices  and  extremal  rays  [32].  (Since  a  line  can  be  interpreted  as  two  rays  in  opposite 
directions  [32],  vertices  and  extremal  rays  are  sufficient  for  polyhedra  decomposition.)  Here, 
we  use  the  following  example  to  show  what  vertices  and  extremal  rays  are.  For  formal 
definitions,  please  refer  to  [32].  Consider  the  polyhedron  Pe  specified  by  two  inequalities: 
Pe  =  {(x,y)  |  2x  —  y  >  2,  and  2y  —  x  >  —1}.  Point  (1,0)  is  the  vertex  and  two  vectors 
(1,2)  and  (2, 1)  are  the  extremal  rays  of  Pe  as  shown  below: 


Algorithm  for  Polyhedron  Decomposition  Since  the  algorithm  for  obtaining  subdo¬ 
main  schedules  (to  be  discussed  in  Section  5.3)  relies  on  the  algorithm  for  decomposing  a 
polyhedron  into  vertices  and  rays,  we  now  review  the  decomposition  algorithm  presented  in 
[13]. 


Let  Q  be  a  polyhedron  defined  by  Q  =  {K  \  AK  >  B},  where  A  is  a  c-by-d  matrix 
and  B  is  a  column  vector  of  length  c  (d  is  the  dimensionality  of  index  domain  D  and 
c  is  the  number  of  constraints).  If  B  is  a  zero  vector,  then  we  call  Q  a  homogeneous 
polyhedron,  otherwise,  a  nonhomogeneous  polyhedron.  A  nonhomogeneous  polyhedron  has 

a  corresponding  homogeneous  polyhedron.  Let  h  be  a  scalar  variable.  Let  Qh  =  { 


[A,-B] 


K 

h 


>  0,  h  >  0},  where 


K 

h 


is  the  vertical  concatenation  of  column  vector  K 


and  scalar  h,  and  [A,-B]  is  the  horizontal  concatenation  of  matrix  A  and  column  vector 


17 


B.  Then  Qh  defines  the  corresponding  homogeneous  polyhedron  of  Q.  A  ray 


oiQh 


R 
h 

implies  that  R  is  a  ray  of  Q  if  h  =  0,  or  ^  is  a  vertex  of  Q  if  h  >  0.  Therefore,  it  suffices 
to  compute  the  extremal  rays  of  a  homogeneous  polyhedron,  and  then  obtain  the  vertices 
and  extremal  rays  for  the  corresponding  nonhomogeneous  polyhedron. 


Algorithm  ER  (Obtaining  Extremal  Rays  of  a  Homogeneous  Polyhedron) 

The  algorithm  starts  with  an  initial  set  71  of  rays  which  can  be  easily  generated.  A  succession 
of  transformations  on  1Z  are  then  performed,  one  for  each  constraint  in  the  system.  The 
ordering  in  which  the  constraints  are  chosen  does  not  affect  the  correctness  of  the  result.  In 
each  transformation,  new  V.  is  computed  according  to  the  projections  of  the  current  rays  in 

r* 


71  on  C  as  described  below,  where  C 


K 

h 


>  0  is  the  current  selected  constraint: 


1.  1Z0  «-  {R  |  R  6  7Z,CR  =  0}; 

2.  7Z+  *-  {R  |  R  €  7Z,CR  >  0}; 

3.  71-  «-  {R  |  R  e  71, CR  <  0}; 

4.  TZ  *—  TZq  U  7Z+  U  +  O2-R2)  |  Ri  G  71+,  R2  €  TZ— ,  o\  >  0, 02  >  0, 

C(aifZi  +  =  0}- 


5.1.3  Linear  Programming  Formulation 

After  polyhedron  decomposition,  each  point  in  polyhedron  V(5i,  S2)  can  be  expressed  as  the 
sum  of  a  convex  combination  of  the  vertices  and  of  a  positive  combination  of  the  extremal 
rays  of  V(Si,S2)  [32].  Based  on  this  property,  Quinton  shows  that  [29,  30]  Equation  (26) 
holds  for  all  dependence  vectors  J  —  I  in  V(5i,  52)  if  and  only  if  the  following  two  conditions 
hold: 


for  all  vertex  V  of  V(5lt  52),  TV  >  0,  and 
for  all  extremal  ray  R  of  V(Si,  S2),  TR  >  0. 

By  Quinton’s  theorem,  the  input  system  C  defined  in  Equation  (29)  can  be  simplified  to: 


{TV  >  0  |  for  all  vertex  V  of  V(5j,  $2)*  where  S 1  and  S2 

are  two  statements  in  S  such  that  Si  =>  52)  , 

(31) 

U  {TR  >  0  |  for  all  extremal  ray  R  of  V(Si,S2),  where  Si  and  S2 
are  two  statements  in  S  such  that  Si  ^  S2}. 

Consequently,  the  row  vector  T  of  length  d,  which  defines  an  SSL-U  schedule  of  a  loop  nest, 
can  be  obtained  by  linear  programming,  where  d  is  the  dimensionality  of  the  index  domain 
of  the  loop  nest.  The  dimensionality  of  the  linear  programming  system  is  d,  and  the  number 
of  constraints  is  the  sum  of  the  number  of  vertices  and  extremal  rays. 


18 


5.2  An  Algorithm  for  Statement  Reordering 

Having  reviewed  the  algorithm  for  generating  an  SSL-U  schedule  with  arbitrary  statement 
reordering  function  r,  we  now  discuss  how  to  obtain  an  SSL-U  schedule  with  statement 
reordering  terms:  n(I ,  S)  =  (gi(I),  r(S)),  where  gi  is  the  temporal  morphism  defined  in 
Equation  (3),  and  r  is  the  statement  reordering  function  defined  in  Equation  (4).  The 
method  can  be  modified  easily  to  obtain  the  statement  reordering  functions  for  other  single- 
sequential  level  schedules,  i.e.  SSL-SD,  SSL-SV  and  SSL-NU. 

The  original  input  system  for  obtaining  a  schedule  tt(J,  5)  =  (<?i(7),  r(S))  should  be: 

C  =  {{gi(J),r(S2))  y  (gi(I),r(Si))  |  there  exist  statements  Si  and  $2  in  5, 

and  index  tuples  7  and  J  in  D,  such  that  Si @7  =>  S2@J}. 

To  solve  for  gi  and  r  separately  in  two  steps,  the  formulation  is  developed  as  follows. 
We  start  with  the  notion  of  minimal  target  difference  vectors. 


Minimal  Target  Difference  Vector  We  define  T  to  be  the  function  space  [D  — ►  JE^], 
where  E\  is  a  one-dimensional  temporal  index  domain,  and  define  Z  to  be  the  set  of  ra¬ 
tional.  We  define  a  second  order  function  p(/,  Si,S2)  to  be  the  minimal  target  difference 
vector  (in  the  sense  of  lexicographical  ordering  on  elements  of  E\)  ranging  over  the  image 
of  the  set  of  dependence  vectors  V(Si,S2)  defined  in  Equation  (28)  under  function  /  €  T: 

H  :  T  xSxS  ->  Z  (33) 

M(/, SuS2)  =  m\n{f(K)  \  K  E  V(S,,  S2)}.  (34) 

It  is  easy  to  see  that  condition  n{gi,Si,S2)  >  0  holds  if  and  only  if  gi(J  -  7)  >  0  holds 

for  all  (J- 7)  e  V(Si,S2). 


Input  Systems  for  gi  and  r  Due  to  the  lexicographical  ordering  >-,  condition 
(5i(J),r(52))  >-  (ffi(7),r(Si))  in  Equation  (32)  holds  for  all  (7  -  7)  G  V(Si,S2)  if  and 
only  if  either  p(ffi,Si,S2)  >  0  holds  or  both  /x(ffi,5i,52)  =  0  and  r(S2)  >  r(Si)  hold. 
Consequently,  as  far  as  the  statement  reordering  function  r  is  concerned,  the  dependences 
5 1  =>  S2  where  n(gi,  Sl5  52)  >  0  do  not  provide  any  constraint  on  r.  Only  for  those 
dependences  Si  =>  S2  such  that  n{gi,  Si,  52)  =  0,  the  conditions  r(Si)  <  r(S2)  must  hold. 
So  the  algorithm  for  generating  SSL  schedules  can  start  out  with  a  less  stringent  criterion 
li(ShSuS2)  >  0  for  all  pairs  of  statements  Sj  and  S2  in  5  such  that  St  =>  52  to  find  git 
and  follow  by  the  criterion  r(Si)  <  r(Si)  for  those  dependence  relation  S\  =>  S2  where 
v(9i,Si,S2)  =  0. 

From  the  above  discussion,  we  know  that  the  input  system  defined  in  Equation  (32) 
can  be  separated  into  two  parts,  one  for  temporal  morphism  g\  and  the  other  for  statement 
reordering  function  r: 


for  gx 


C g ,  =  {n(gi,  Si,  S2)  >  0  j  there  exist  statements  S\  and  52  in  5, 

such  that  Si  =»  S2}, 


(35) 
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(36) 


CT  =  {r(Si)  <  r(Si)  |  there  exist  statements  Si  and  S2  in  S, 

for  t  : 

such  that  Si  =►  S2  and  p(jiu  S\,  S2)  =  0}. 

The  algorithm  discussed  in  Section  5.1,  and  those  to  be  discussed  in  Section  5.3,  5.4, 
and  5.5  can  be  used  to  obtain  temporal  morphism  g\  from  the  input  system  specified  in 
Equation  (35).  We  now  discuss  how  to  obtain  statement  reordering  function  r  from  the 
input  system  specified  in  Equation  (36)  given  a  temporal  morphism  g\. 

Partial  Ordering  and  Topological  Sort  To  obtain  statement  reordering  function  r 
given  a  temporal  morphism  */i ,  we  need  the  notion  of  partial  ordering  [16].  A  partial  order 
on  a  set  S  is  a  binary  relation  “<”  that  is  transitive  and  irreflexive,  i.e.  there  cannot  be  any 
cyclic  relations  x  <  . . .  <  x  for  all  elements  x  £  S.  For  the  purpose  of  statement  reordering, 
we  define  5  to  be  the  set  of  statements  S  and  we  say  Si  <  S2,  i.e.  statement  Si  must  be  in 
front  of  statement  S2  in  the  transformed  loop  body,  if  Si  =>  S2  and  p(gi,  Si,  S2)  =  0. 

If  S  has  a  partial  ordering,  then  topological  sort  can  produce  a  linear  ordering  of  the 
statements  [16],  which  defines  r.  If  S  does  not  have  a  partial  ordering,  e.g.  if  there  are 
cyclic  relations  Si  <  S2  and  S2  <  Si,  then  there  cannot  be  any  r  that  will  satisfy  both 
r(Si)  <  r(S2)  and  r(Si)  >  KS2).  In  this  case,  the  given  temporal  morphism  g\  should  be 
rejected. 

5.3  Subdomain  Scheduling  Algorithm  with  Bounded  Search  Space 

This  section  presents  an  algorithm  for  generating  SSL-SD  schedules.  The  hard  part  of 
finding  a  subdomain  schedule  for  a  given  loop  nest  is  to  determine  where  the  subdomain 
boundaries  are.  In  this  section,  we  present  an  algorithm  which  searches  through  a  bounded 
space  for  possible  hyperplanes  that  partition  the  domain,  and  comes  up  with  affine  sched¬ 
ules,  one  for  each  subdomain.  Since  the  complexity  of  this  method  is  too  high,  in  Section  5.4 
we  describe  a  heuristic  that  first  makes  guesses  at  possible  subdomain  boundaries  by  un¬ 
bounded  inside-out  enumerative  search,  and  then  obtains  a  subdomain  schedule  by  linear 
programming  with  given  subdomains. 

5.3.1  Problem  Formulation 

Constraints  Derived  from  Data  Dependence  For  an  SSL-SD  schedule  defined  in 
Equation  (8),  the  inequality 

{TjJ'rjfrV-iTilMSiVyb  (37) 

must  hold  for  all  index  tuples  I  6  D{  and  J  6  Dj,  1  <  i,j  <  m,  and  statements  Si  and  S2 
in  S  such  that  Si  @7  =>  S2@J,  where  T,  and  Tj  are  row  vectors  of  length  d.  We  focus  on 
the  problem  of  obtaining  Ti  satisfying  the  condition 

TjJ-TtI  =  [-TuT3} 
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>0, 


(38) 


which  is  the  first  step  in  obtaining  an  SSL-SD  schedule.  This  step  will  be  followed  by  a 
topological  sort  to  find  the  statement  reordering  function  as  discussed  in  Section  5.2. 

Since  row  vector  T,  can  be  different  from  Tj,  Equation  (38)  cannot  be  rewritten  as 
T(  J  —  I)  >  0.  Consequently,  dependence  vectors  are  not  adequate  for  obtaining  subdomain 
schedules.  A  new  representation  of  a  dependence  relation  is  necessary. 


Dependence  Index  pairs  We  now  introduce  a  new  notion  of  dependence  relations.  If 
dependence  Si@I  =>  S2@J  exists,  then  we  call  (/,  J)  a  dependence  index  pair  from  statement 
Si  to  S2. 


Space  of  Dependence  Index  Pairs  Similar  to  the  set  of  dependence  vectors  V(Si,  S2) 
defined  in  Equation  (28),  we  formulate  the  set  of  dependence  index  pairs  from  statement 
5j  to  $2,  denoted  by  V(Si,S2),  as: 

P(Si,S2)  =  {(I,J)  |  /  e  (DlPi),J  6  (DIP2),  and  X{I)  =  y(J)}.  (39) 

It  is  easy  to  see  that,  under  the  restrictions  discussed  in  Section  5.1,  V(S\,S2)  is  a  (2d)- 
dimensional  polyhedron,  where  d  is  the  dimensionality  of  the  index  domain  of  the  loop 
nest. 

Statement  Reordering  In  Section  5.2,  we  discuss  how  to  obtain  a  statement  reordering 
function  r  given  an  SSL-U  temporal  morphism  g\.  In  fact,  all  formulations  in  Section  5.2 
also  hold  for  other  classes  of  <71,  i.e.  SSL-SD,  SSL-SV  and  SSL-NU,  except  that  we  need 
to  use  the  set  of  dependence  index  pairs  V{Si,S2)  to  replace  the  set  of  dependence  vectors 
V(Si,  S2)  in  the  formulation. 

Let  T,  Ei,  Z  be  as  defined  in  Section  5.2.  We  define  a  second-order  function  p(f,  5j ,  S2) 
to  be  the  minimal  target  difference  vector  ranging  over  the  vectors  f(J)  —  /(/)  where 
(I,  J)  €  T*(5i,  S2)  and  /  €  T: 

M  St,  S2)  =  min  {/(J)  -  /(/)  |  (I,  J )  €  V(St,  S2)}.  (40) 


Input  System  The  input  system  for  obtaining  an  SSL-SD  temporal  morphism  g\  consists 
of  the  following  constraints: 


C  =  {[-Tt,Tj\ 


>  0  |  there  exist  statements  S\  and  S2  in  S,  such  that 


(I,J)  €  Q{St,S2,i,j)},  where 
Q(St,S2,i,j)=  i(T,J)\  (I,J)€V(Si,S2),I<E  d{,j  e  Dj}. 


(41) 


Let  B  be  a  row  vector  of  length  d  and  c  be  a  scalar.  Let  Z  be  the  set  of  rationals.  A 
hyperplane  is  a  subspace  of  the  d-dimensional  vector  space  Zd  that  can  be  expressed  as 
{/  |  BI  =  c}.  In  order  to  obtain  subdomain  schedules  systematically,  we  restrict  that 
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the  subdomains  are  separated  by  hyperplanes.  Under  this  restriction,  each  subdomain  Dt, 
1  <  t  <  m,  is  a  d-dimensional  polyhedron,  and  each  Q(Si,  S2,  i,j)  defined  in  Equation  (41) 
is  a  (2d)-dimensional  polyhedron. 

Again,  we  need  to  represent  each  polyhedron  <2(Sj,  S2,  i,j)  by  its  vertices  and  extremal 
rays.  However,  since  polyhedron  Q(Sx,S2,t,j )  defined  in  Equation  (41)  is  specified  by  two 
unknown  subdomains  D,  and  D3,  Algorithm  ER  of  Section  5.1  is  not  directly  applicable. 
So  the  key  point  is  to  do  polyhedra  decomposition  under  unknown  constraints.  Then 
the  input  system  defined  in  Equation  (41)  can  be  simplified  to  another  system  with  all 
dependence  index  pairs  being  replaced  by  vertices  and  extremal  rays  as  discussed  before. 
In  the  following,  we  first  use  a  very  simple  example  to  show  the  basic  idea  of  doing  this. 
We  then  present  the  formal  solution. 


5.3.2  Basic  Idea 

We  take  the  following  loop  whose  index  domain  D  is  a  one-dimensional  interval  domain 
[/,  u]  as  an  example: 

Loop  Nest  14 


DO  (i  =  /,«){ 

Si:  IF(te(/i,«il)A(10)  =  ... 

S2:  IF(»G[/2,tt2])  ...  =  A(10) 

} 

where  /,  u,  /1,  t*i,  /2  and  u2  are  integer  constants  or  variables  under  the  conditions  [/1,  ui]  C 
[l,u]  and  [/2,u2]  C  [/,«].  For  this  example,  V(SX,S2),  defined  in  Equation  (39),  is  a  two- 
dimensional  polyhedron: 

V(SUS2)  =  {( i,j )  |  i  G  [/i,«i],J  €  [/2,«2],10  =  10},  (42) 

which  contains  four  vertices  (lx,l2),  (lx,u2),  (u i,/2),  and  («i,u2).  Let  D\  and  D2 

Dx  =  Di(i  >  c)  and  (43) 

Dx  =  Di(i  <  c)  (44) 

be  two  subdomains  of  D  separated  by  the  hyperplane  i  =  c,  where  c  is  an  unknown  scalar. 
Under  Dx  and  D2,  polyhedron  V(SX,S2)  is  partitioned  into  four  disjoint  parts  Q(SX,  S2, 1, 1), 
Q(SX,  52, 1,2),  <5(5i,52,2,1),  and  Q(Si,S2,2,2)  as  defined  in  Equation  (41),  where 

Q(5i,S2,1,1)  =  {(i,j)  |  i  G  [/i,«i],j  G  [/2,u2],i  >  c,j  >  c},  (45) 

and  others  have  similar  formulation.  In  the  following  we  let  Qu  =  Q(SX,  S2, 1,1). 
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Figure  2:  Bounded  Search  Space 

Bounded  Search  One  way  to  find  the  vertices  of  Qn  is  to  try  all  possible  c  values.  For 
each  c,  the  vertices  of  Qn  can  be  obtained  by  Algorithm  ER  of  Section  5.1.  However,  there 
can  be  too  many  c  to  be  searched,  or  even  infinitely  many  if  the  loop  bounds  are  unknown 
at  compile  time.  In  fact,  for  any  c  /alue,  a  vertex  (x,  y )  of  V(S j,  S2)  implies  that 

1.  (x,  y)  is  a  vertex  of  Qn  if  x  >  c  and  y  >  c, 

2.  (x,c)  is  a  vertex  of  Qn  if  x  >  c  and  y  <  c  <  u2,  as  shown  if  Figure  2(a), 

3.  (c,  y)  is  a  vertex  of  Qn  if  x  <  c  <  tii  and  y  >  c,  as  shown  if  Figure  2(b), 

4.  (c,  c)  is  a  vertex  of  Qu  if  x  <  c  <  uj  and  y  <  c  <  u2.  as  shown  if  Figure  2(c),  or 

5.  Qn  has  no  vertex  if  c  >  ui  or  c  >  u2. 

Therefore,  based  on  the  relative  values  of  x,  y  and  c,  there  are  four  possible  vertices  of  Q 
from  a  vertex  ( x,y )  of  V(Si,  S2).  Note  that  three  of  the  four  possible  vertices  of  Q  are 
parameterized  with  unknown  scalar  c.  So  the  basic  idea  is  that  the  search  space  of  the 
partitioning  hyperplanes  can  be  bounded  if  parameterized  vertices  and  extremal  rays  are 
used. 


Nonlinear  Programming  With  parameterized  vertices  and  extremal  rays,  the  input 
system  defined  in  Equation  (41)  will  become  a  nonlinear  system.  Hence  bounded  search 
and  nonlinear  programming  are  required  to  obtain  the  partitioning  hyperplanes  and  the 
schedule  for  each  disjoint  subdomain. 

We  now  present  the  formal  solution.  We  first  focus  on  the  case  when  there  is  one 
partitioning  hyperplane.  The  more  general  cases  of  multiple  hyperplanes  are  discussed 
later. 
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5.3.3  One  Partitioning  Hyperplane 


A  partitioning  hyperplane  can  be  expressed  as  BI  =  c,  where  B  is  an  unknown  row  vector  of 
length  d  and  c  is  an  unknown  scalar.  In  the  case  of  one  partitioning  hyperplane,  an  SSL-SD 
schedule  can  be  formulated  as  follows,  again  ignoring  the  statement  reordering  terms  for 
the  moment: 


»(/,$) 


(I  6  D)A  (BI  >  c) 
(I  6  D)  A  (BI  <  c) 


T\I  1 
T2I  j 


(46) 


According  to  subdomains  D\  =  D^BI  >  c )  and  D2  =  D^BI  <  c),  polyhedron  Q(Si,  S2, 1, 1) 
defined  in  Equation  (41)  is: 


Qn  =  Q(Su  s2, 1, 1)  =  V(SU  S2)l((BI  >  c)  A  (BJ  >  c)).  (47) 


Polyhedra  Decomposition  Under  Unknown  Constraints  Again,  we  need  to  rep¬ 
resent  Q\\  defined  in  Equation  (47)  by  its  vertices  and  extremal  rays.  As  described  in 
Section  5.1,  this  is  equivalent  to  finding  the  extremal  rays  of  polyhedron  Qh,  which  is  the 
corresponding  homogeneous  polyhedron  of  Q\\: 


Qk  =  Pkim 


)  >  0)  A  (B2 


K 

h 


)  >  0)), 


(48) 


K 

h 


>  0 


where  Ph  is  the  corresponding  homogeneous  polyhedron  of  V(Si,S2),  and  Bi 

>  0  are  the  homogeneous  representations  of  BI  >  c  and  BJ  >  c  respectively, 


and  B2 
i.e. 


K 

h 


B\  =  [B,0,  -c],  and 
B2  =  [6,  B,  -c]. 


(49) 

(50) 

(51) 


Note  that  since  row  vector  B  and  scalar  c  given  in  Equation  (46)  are  unknown,  row  vectors 
B\  and  B2  are  also  unknown.  Recall  that  Algorithm  ER  of  Section  5.1  iterates  over  the 
set  of  input  constraints.  Therefore,  in  obtaining  the  extremal  rays  of  Qk,  we  can  first 
obtain  the  set  of  extremal  rays  of  Pk,  denoted  by  TZ.  Two  more  iterations  are  required  to 


obtain  the  extremal  rays  of  Qh  from  72,  one  for  each  unknown  constraint  B\ 


I 

h 


>  0  and 


B2 


J 

h 


>  0. 
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Bounded  Search  Let  7 2  be  the  set  of  extremal  rays  of  J\,  72 1  be  the  set  of  extremal  rays 
>  0),  and  722  be  the  set  of  extremal  rays  of  Qh  defined  in  Equation  (48). 
We  have  the  following  lemma: 

Lemma  1  The  number  of  possible  sets  72  j  and  7Z2  derived  from  all  possible  row  vectors  Bx 
and  B2  is  bounded. 

Proof.  From  Algorithm  ER,  we  know  that  the  set  72i  is  ob  ained  according  to  the  pro¬ 
jections  of  the  rays  in  72  on  B\.  Although  Bx  is  unknown,  the  projection  of  a  vector  on 
any  vector  can  only  be  either  zero,  positive  or  negative.  Therefore,  based  on  tae  signs  of 
projections,  if  there  are  r  rays  in  72,  then  there  are  at  most  3r  possible  sets  72i-  And  since 
a  ray  in  72j  is  generated  from  either  a  ray  or  a  pair  of  rays  in  72,  each  set  72i  contains  a 
finite  number  of  rays.  Let  q  be  the  maximal  number  of  rays  in  a.1  sets  72j.  Clearly,  there 
are  at  most  3r+9  possible  sets  H2. 


of  PhI(Bx 


Nonlinear  Formulation  of  Extremal  Rays  Since  the  values  of  the  projections  are 
unknown,  some  rays  in  72j  need  to  be  specified  by  unknown  scalars.  For  example,  a  ray  in 
72x  generated  from  two  rays  R\  and  R2  in  72  is  a  linear  combination  of  R\  and  R2: 

otxR\  +  ct2R2 ,  (52) 

where  ax  and  a2  are  positive  variables  satisfying  the  nonlinear  condition  Bi(axRx  +a2R2)  — 
0  as  described  in  Algorithm  ER. 

Similarly,  a  ray  of  Qh  generated  from  two  linear  rays  (oi/?i  -f  a2R2)  and  (03.R3  +  o4/?4) 
in  V-i  is  a  nonlinear  combination  of  /?,,  1  <  *  <  4: 

05(01/?!  +  a2R2)  +  Qe(o3/?3  +  o4/?4),  (53) 

where  o5  and  06  are  positive  variables  satisfying  the  nonlinear  condition  /?2(Qs(ai/?i  + 

a2/?2)  +  C*6(a3/?3  +  ®  4 /?<<))  =  0. 


Nonlinear  Constraints  With  unknown  7j,  T2,  B 1,  B2  and  o,  we  may  have  the  following 
nonlinear  inequalities  that  constitutes  a  system  for  obtaining  an  SSL-SD  schedule  defined 


in  Equation  (46): 

(-Ti,  72](05(0i /?!-(- 02/?2)  +  Q6(03-/?3  +  04/?4))  >  0,  (54) 

Bx(axRi  -f  a2R2)  =  0,  (55) 

Ri(q3/?3  +  o4/?4)  =  0,  and  (56) 

B2{oih(Qi Rx  +  a2R2)  +  05(03/23  4-  o4/?4))  =  0.  (57) 


Equation  (54)  is  a  constraint  for  7j  and  T2,  which  specify  the  mapping  of  a  subdomain 
schedule,  and  Equations  (55),  (56)  and  (57)  are  constraints  for  B  and  c,  which  specify  the 
partitioning  hyperplane  of  a  subdomain  schedule.  Therefore,  the  order  of  the  nonlinear 
constraints  is  three. 
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Nonlinear  Systems  To  summarize,  given  one  partitioning  hyperplane,  there  is  a  bounded 
number  of  sets  of  parameterized  extremal  rays,  where  each  extremal  ray  can  be  specified 
by  a  nonlinear  expression  of  order  two  or  less.  Each  set  of  parameterized  extremal  rays 
corresponds  to  a  possible  partition  configuration.  And  we  need  to  solve  a  nonlinear  system 
of  order  three  for  each  possible  partition  configuration  to  obtain  Tj,  T-i,  B,  and  c,  which 
define  a  subdomain  schedule  given  in  Equation  (46).  Clearly,  many  partition  configurations 
would  be  invalid,  and  there  can  be  one  or  more  suitable  partition  configuration,  or  none. 

From  the  above  discussion,  we  have  the  following  theorem: 

Theorem  2  An  SSL-SD  schedule  with  one  partitioning  hyperplane  on  the  domain  can  be 
obtained  by  enumerative  secrch  through  a  bounded  search  space,  at  each  instance  of  which 
nonlinear  programming  of  order  three  is  required  to  obtain  the  partitioning  hyperplane  and 
the  schedule  for  each  subdomain. 


5.3.4  Multiple  Partitioning  Hyperplanes 

We  now  consider  the  cases  with  more  than  two  subdomains  separated  by  multiple  parti¬ 
tioning  hyperplanes.  From  the  above  case,  it  is  easy  to  see  that  if  there  are  p  partitioning 
hyperplanes,  then  polyhedron  Qh  defined  in  Equation  (48)  would  be  specified  by  2 p  un¬ 
known  constraints,  and  each  extremal  ray  of  Qh  would  be  of  order  2 p  or  less.  Therefore, 
the  order  of  the  systems  for  obtaining  an  SSL-SD  schedule  would  be  2p  +  1.  We  thus  have 
the  following  corollary: 

Corollary  3  An  SSL-SD  schedule  with  p  partitioning  hyperplanes  on  the  domain  can  be 
obtained  by  enumerative  search  through  a  bounded  search  space,  at  each  instance  of  which 
nonlinear  programming  of  order  2p  -f  1  is  required  to  compute  the  partitioning  hyperplanes 
and  the  schedule  for  each  subdomain. 


5.4  A  Heuristic  Algorithm  for  Generating  Subdomain  Schedules 

Inside-Out  Enumerative  Search  We  now  show  that,  if  the  subdomains  of  a  subdomain 
schedule  are  separated  by  a  set  of  given  partitioning  hyperplanes,  then  an  SSL-SD  schedule 
can  be  obtained  by  linear  programming.  We  can  use  inside-out  unbounded  enumerative 
search  [18,  19]  as  a  heuristic  to  generate  these  partitioning  hyperplanes.  Let  a  partitioning 
hyperplane  be  specified  by  BI  —  c,  where  B  is  a  row  vector,  and  c  is  a  scalar.  The  inside-out 
enumerative  search  starts  by  setting  the  bounds  of  the  absolute  values  of  the  elements  of 
row  vector  B  and  scalar  c  to  be  1,  then  2,  3,  etc.,  until  a  schedule  is  found.  Hopefully,  if 
there  exists  a  subdomain  schedule,  the  values  of  the  elements  of  B  and  c  would  be  small 
and  found  quickly.  And  this  appears  to  be  the  case  for  many  example  programs  including 
the  dynamic  programming  example  to  be  discussed  in  Section  8. 

Linear  Programming  Formulation  If  all  partitioning  hyperplanes  are  given,  then  each 
polyhedron  Q(S-t,  S2,  i,  j)  defined  in  Equation  (41)  can  be  decomposed  into  vertices  and 
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extremal  rays  by  Algorithm  ER  of  Section  5.1.  Therefore,  the  input  system  C  defined  in 
Equation  (41)  can  be  simplified  to  the  following  form: 

{[—Ti,Tj]V  >  0  |  for  all  vertex  V  of  Q(Si,  S2,  i,j),  where  Si  and  S2 
are  two  statements  in  S  such  that  Si  ^  So) 

(58) 

U  {[-T{,Tj]R  >  0  j  for  all  extremal  ray  R  of  Q(Si,  S2,  i,  j),  where  Si  and  S2 
are  two  statements  in  S  such  that  Si  =>  S2}. 

Consequently,  m  row  vectors  T,,  1  <  i  <  m,  of  length  d,  which  define  an  SSL-SD  schedule, 
can  be  obtained  by  linear  programming.  The  dimensionality  of  the  linear  programming 
system  is  m  *  d,  where  m  is  the  number  of  subdomains  and  d  is  the  dimensionality  of  the 
index  domain  of  the  loop  nest.  We  thus  have  the  following  theorem: 

Theorem  4  Given  the  partitioning  hyperplanes  of  the  index  domain  D  of  a  loop  nest, 
an  SSL-SD  schedule  can  be  obtained  by  linear  programming.  The  dimensionality  of  the 
linear  programming  system  is  m  *  d,  where  m  is  the  number  of  subdomains  and  d  is  the 
dimensionality  of  the  index  domain  of  the  loop  nest. 


5.5  An  Algorithm  for  Generating  Statement- Variant  Schedules 


Single  Statement  in  Each  Partition  From  the  subdomain  schedule,  we  know  that  it 
is  hard  to  find  the  partition  of  a  domain  and  the  schedule  for  each  subdomain  at  the  same 
time.  If  the  partition  of  the  domain  is  given,  then  the  problem  becomes  much  simpler.  This 
is  also  true  for  obtaining  statement-variant  schedules.  Let  s  be  the  number  of  statements 
in  a  loop  body.  If  s  is  not  too  large,  then  we  can  consider  the  extreme  case  of  a  statement- 
variant  schedule  that  each  partition  contains  only  one  statement.  In  this  case,  the  input 
system  C  containing  all  constraints  is: 


C  =  {[-T„T: } 


=  {{-T„T}] 


J 

J 

I 

J 


>  0  |  there  exist  statements  Si  and  Sj  in  S,  and  index  tuples 

/  and  J  in  D  such  that  S,@J  =>  S,@J}. 

(59) 

>  0  j  there  exist  statements  S,  and  Sj  in  5,  such  that 


(i,j)eV(Si,  5,)}, 


where  P(S,,  Sj)  is  defined  in  Equation  (39). 


Linear  Programming  Formulation  Since  the  vertices  and  extremal  rays  of  P(S,-,Sj) 
can  be  obtained  by  by  Algorithm  ER  of  Section  5.1,  the  input  system  C  defined  in  Equation 
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(59)  can  be  simplified  to: 

{[-Ti,Tj]V  >  0  |  for  all  vertex  V  of  P(Si,Sj),  where  S;  and  Sj 
are  two  statements  in  S  such  that  5,  Sj} 

U  {[— T,,  Tj]R  >  0  |  for  all  extremal  ray  R  of  P(S{,  Sj),  where  S,  and  Sj 
are  two  statements  in  S  such  that  Si  =>  Sj}. 


(60) 


Consequently,  s  row  vectors  T{,  1  <  i  <  s,  of  length  d ,  which  define  an  SSL-SV  schedule, 
can  be  obtained  by  linear  programming.  We  thus  have  the  following  theorem: 


Theorem  5  An  SSL-SV  schedule  of  a  loop  nest,  where  each  partition  of  statements  contains 
only  one  statement,  can  be  obtained  by  linear  programming.  The  dimensionality  of  the  linear 
programming  system  is  s*d,  where  s  is  the  number  of  statements  and  d  is  the  dimensionality 
of  the  index  domain  of  the  loop  nest. 


Algorithm  for  Generating  Nonuniform  Schedules  Since  a  nonuniform  schedule  is  a 
combination  of  a  subdomain  schedule  and  a  statement-variant  schedule,  it  can  be  obtained 
by  the  algorithms  described  in  Sections  5.3,  5.4  and  5.5. 


6  An  Iterative  Algorithm  for  Generating  Multiple- Sequential 
Level  Schedules 

We  now  present  the  algorithm  for  generating  multiple-sequential  level  schedules.  To  obtain 
an  n-sequential  level  schedule,  our  approach  is  to  first  find  n  SSL  temporal  morphisms 
g\,  . . .,  g ",  one  at  a  time,  followed  by  generating  the  statement  reordering  function  r. 
This  technique  applies  to  all  schedules:  uniform,  subdomain- variant,  statement-variant  or 
nonuniform. 

To  describe  an  iterative  algorithm,  we  need  to  define  a  sequence  of  minimal  target 
difference  vectors  similar  to  the  ones  defined  in  Equations  (34)  and  (40). 


A  Sequence  of  Target  Difference  Vectors  We  first  define  T,,  1  <  i  <  n,  to  be  the 
function  space  [D  -*  £'  x  ...x  £}],  where  each  E{,  1  <  j  <  n,  is  a  one- dimensional 
temporal  index  domain,  and  define  Z  to  be  the  set  of  rationals. 

For  uniform  schedules,  we  define  a  family  of  second-order  functions  pi(f,S\,S2),  1  < 
i  <  d,  to  be  a  sequence  of  minimal  target  difference  vectors  (in  the  sense  of  lexicographical 
ordering  on  elements  of  E\  x  . . .  x  E[)  ranging  over  the  images  of  the  set  of  dependence 
vectors  V(5j,52)  defined  in  Equation  (28)  under  function  /  €  T,-: 

„  :  r,xSxS-2,  where 
Hi(f,S i,S2)  =  min{f(K)\Ke  V(5,,52)}. 
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For  subdomain-variant,  statement-variant  and  nonuniform  schedules,  we  define  a  family 
of  second-order  functions  /*«(/,  Si,  S2),  1  <  *  <  d,  to  be  a  sequence  of  minimal  target 
difference  vectors  ranging  ever  the  vectors  f(  J)  -  /(/)  where  (/,  J )  6  V(Si,S2)  and  /  €  T,: 

Mi(f,  Si,  S2)  =  min{/(  J)  -  f(I)  |  (/,  J)  €  7>(5i,  S2)>.  (62) 


Iterative  Algorithm  Similar  to  the  algorithm  for  obtaining  SSL  schedules  with  state¬ 
ment  reordering  functions  as  discussed  in  Section  5.2,  the  algorithm  for  generating  multiple- 
sequential  level  schedules  starts  with  criterion  n(g\,  Si,  S2)  >  0  for  all  pairs  of  statements 
S\  and  S2  such  that  Si  =>  52  to  generate  an  SSL  temporal  morphism  g}.  We  then  use 
the  constraint  /i(gf,  Si,  S2)  >  0  for  those  dependences  Si  =>  52  such  that  Si,  S2)  =  0 
to  find  g\.  The  same  process  is  iterated  until  the  n-th  iteration,  1  <  n  <  d,  when  the 
remaining  dependences  Si  =>  S2  where  •  • . ,  <?")>  Si,  S2)  =  6  are  not  cyclic.  In  this 

case,  the  set  of  statements  S  has  the  partial  ordering  defined  in  Section  5.2.  We  then  use 
topological  sort  to  find  the  linear  ordering  of  the  statements. 

We  summarize  the  above  discussion  as  the  following  iterative  algorithm  for  generating 
multiple-sequential  level  schedules:  The  inputs  of  Algorithm  MSL  consist  of  the  set  of  all 
dependences  in  a  loop  nest,  denoted  by  £,  and  the  choice  of  one  of  the  algorithms  presented 
in  Section  5.1,  5.3,  5.4  or  5.5,  denoted  by  A. 

Algorithm  MSL  {C  :  a  set  of  dependences ,  A  :  an  algorithm) 

1.  i  <—  0  (t  will  be  ranging  over  the  loop  levels); 

2.  While  i  <  d  and  £  contains  cyclic  dependences,  do 

(a)  i  i  +  1; 

(b)  Find  an  SSL  temporal  morphism  g[  by  using  algorithm  A  such  that  n(g[ ,  5j ,  S2)  > 
0  holds  for  all  Si  =>  52  €  £', 

(c)  If  such  g\  does  not  exist,  then  the  algorithm  terminates  without  returning  a 
schedule; 

(d)  Else,  £  <—  {(5i  =$»  52)  |  (5j  =s>  52)  €  £  such  that  n(g\,  Si,  S2)  =  0); 

3.  (Now  £  contains  no  cyclic  dependences.)  Find  a  statement  reordering  function  r  by 
topological  sort,  such  that  r(5i)  <  r(52)  for  all  (5i  =>  52)  G  £. 

Note  that  the  sequential  loops  in  the  transformed  loop  nest  generated  by  Algorithm 
MSL  are  perfectly  nested. 


7  A  Recursive  Algorithm  for  Generating  Mixed  Schedules 

We  now  present  the  algorithm  for  generating  mixed  schedules.  We  first  discuss  the  basic 
idea  of  this  algorithm. 
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Basic  Idea  Algorithm  MSL  presented  in  the  previous  section  for  generating  multiple- 
sequential  level  schedules  treats  all  the  statements  in  the  same  way  throughout  the  itera¬ 
tions  even  though  more  and  more  of  the  dependences  become  uninformative  in  obtaining 
the  sequence  g\,  . . .,  <7".  Clearly,  when  some  instances  of  dependence  relations  are  not 
considered,  a  new  set  of  equivalence  classes  under  relation  over  statements  emerges. 
As  discussed  in  Section  2,  these  new  equivalence  classes  can  be  scheduled  separately  for  the 
subsequent  iterations.  To  do  this,  a  tree  of  temporal  morphisms,  instead  of  just  a  sequence 
•  •  •?  9\i  be  generated. 

Labeling  a  Tree  We  use  a  prefix  notation  to  label  each  node  of  the  trees  of  temporal 
morphisms,  equivalence  classes  and  sets  of  dependence  relations  to  be  defined  later.  Let  the 
root  of  the  tree  be  labeled  1.  For  a  node  which  is  the  t-th  child  of  its  parent  node,  we  say 
the  order  of  this  node  is  i.  A  node  is  labeled  /  o  i  if  /  is  the  label  of  its  parent  node  and  it 
has  order  i.  A  temporal  morphism  g\  with  label  /  is  denoted  by  g\(l),  an  equivalence  class 
0  with  label  l  is  denoted  by  B(l),  and  a  set  of  dependences  £  with  label  l  is  denoted  by 

AO- 


Recursive  Algorithm  The  initial  inputs  of  the  recursive  algorithm  for  generating  mixed 
schedules  consist  of  the  equivalence  class  .0(1),  which  contains  all  statements  in  S,  and  the 
set  £(1),  which  contains  all  dependences  in  the  source  loop  nest.  With  inputs  0(Z)  and 
£(/),  the  algorithm  obtains  an  SSL  temporal  morphism  gi(l)  such  that  n(gt(l),  Si,  S2)  >  0 
holds  for  all  (Si  =>  S2)  G  £(/).  Under  gi(l),  the  set  of  dependences 

W)  =  {(Si  =>  S2)  |  (Si  =*  S2)  6  C(l),n(gi(l),Si,S2)  >  0)  (63) 

becomes  uninformative  and  should  be  removed  from  C(l).  A  new  set  of  equivalence  classes, 
denoted  by  new(Z),  emerges: 

new(/)  =  {0(/  o  i )  |  B(l  o  i)  is  the  i-th  equivalence  class  from  £(/)  -  £/(/)}.  (64) 

For  each  new  equivalence  class  B(l  o  i),  1  <  i  <  |new(/)|,  the  associated  set  of  dependences 
£(/  o  i)  is 

£(/  o  i)  =  {(Sj  =>  S2)  |  Sx  G  B(l  o  i ),  52  G  B(l  o  i ),  (5,  =*  S2)  G  (£(/)  -  U(l))}.  (65) 

If  B(l  o  t),  1  <  *  <  |new(Z)|,  is  a  dependent  block,  then  the  same  algorithm  is  applied 
recursively  to  the  new  inputs  B(l  o  i )  and  C(l  0  i).  The  recursive  algorithm  stops  when  all 
dependent  blocks  have  been  broken  up  and  all  remaining  blocks  are  independent. 

We  summarize  the  above  discussion  as  the  following  recursive  algorithm  for  generating 
mixed  schedules:  The  initial  inputs  of  Algorithm  MIX  consist  of  the  equivalence  class  0(1), 
the  set  of  dependences  £(1),  and  the  choice  of  one  of  the  algorithms  presented  in  Section  5.1, 
5.3,  5.4  or  5.5,  denoted  by  A. 

Algorithm  MIX  (0(Z)  :  an  equivalence  class,  £(/) :  a  set  of  dependences,^  :  an  algorithm) 

1.  Find  an  SSL  temporal  morphism  gi(l)  by  using  algorithm  A,  such  that 
n(gi(l),  Si,  S2)  >  0  holds  for  all  (Si  =»  S2)  G  £(/); 
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2.  If  such  9i(l)  does  not  exist,  then  the  algorithm  terminates  without  returning  a  sched¬ 
ule. 

3.  obtain  the  set  new(/)  (remove  uninformative  dependences  and  generate  new  equiva¬ 
lence  classes); 

4.  Generate  new  sets  of  dependences  C{1  o  z),  1  <  i  <  |new(/)|; 

5.  For  i  G  [1,  |new(/)|],  if  B(l  o  t)  is  a  dependent  block,  then  MIX( B(l  o  z),  C(l  o  z),  >1). 


Statement  Reordering  We  now  discuss  how  to  obtain  the  statement  reordering  function 
r  for  mixed  schedules.  Since  the  algorithm  for  generating  mixed  schedules  is  recursive, 
statement  ordering  is  also  obtained  recursively.  Suppose  we  want  to  find  the  ordering  among 
the  statements  in  an  equivalence  class  £(/).  Let  B{1  o  i),  1  <  i  <  n,  be  the  equivalence 
classes  in  new(Z)  generated  from  B(l).  Wf  can  first  determine  the  ordering  among  these  n 
equivalence  classes,  which  induces  the  ordering  among  statements  in  different  B{1  o  z),  but 
not  the  ordering  among  statements  within  the  same  B(l  o  i ).  Since  each  B(l  o  *),  1  <  i  <  n, 
will  be  scheduled  separately,  the  ordering  among  statements  in  different  B(l  o  i)  will  never 
be  changed  subsequently.  The  same  process  is  then  applied  recursively  to  each  B(l  o  z), 
1  <  z  <  n,  to  determine  the  ordering  among  statements  within  B(l  o  z). 

The  ordering  among  these  n  equivalence  classes  is  obtained  as  follows.  Similar  to  Sec¬ 
tion  5.2,  we  define  a  partial  ordering  “<”  over  the  set  of  equivalence  classes  new(Z).  We  say 
B(l  ol)  <  B(l  o2),i.e.  equivalence  class  R(/ol)must  be  in  front  of  equivalence  class  B(lo2) 
in  the  transformed  loop  body,  if  there  exist  statements  5i  G  B(l  o  1)  and  S2  G  B(l  o  2)  such 
that  (Si  =>  S2)  G  £(/)  and  n(gi(l).  Si,  S2)  =  0. 

Due  to  the  way  these  n  equivalence  classes  are  generated  from  B(l),  new (/)  always  has 
this  partial  ordering.  Therefore,  topological  sort  can  produce  the  linear  ordering  of  these  n 
equivalence  classes. 


Imperfect  Loop  Nests  For  a  given  statement  S,  it  may  belong  to  a  nested  level  of 
dependent  blocks  f?(l),  . . .,  B(l),  where  B(  1)  D  ...  D  B(l).  Clearly,  the  schedule  of 
statement  S  is  |/|-sequential  level,  where  |/|  is  the  length  of  the  label  l : 

zr(/,  S)  =  (5l(l )(/), . . . , *(0(J),  r(S)).  (66) 

Since  different  statements  can  belong  to  different  sets  of  dependent  blocks,  the  sequential 
level  of  7r(J,  5)  can  be  different  for  different  statements.  And  the  sequential  loops  in  the 
transformed  parallel  loop  nest  can  be  of  any  form,  i.e.  perfectly  and  imperfectly  nested 
loops. 

When  different  equivalence  classes  are  scheduled  differently,  each  SSL  temporal  mor¬ 
phism  will  be  a  statement-variant  schedule  with  the  partition  of  statements  being  these 
new  equivalence  classes.  Recall  that  in  Section  5.5,  a  statement-variant  schedule  is  ob¬ 
tained  by  allowing  each  statement  to  be  scheduled  differently,  but  not  independently.  The 
advantage  to  schedule  independently  is  that  the  likelihood  of  obtaining  a  suitable  schedule 
with  more  parallelism  increases  if  each  dependent  block  is  considered  separately. 
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Example  We  use  the  following  example  to  explain  Algorithm  MIX  further.  In  this  ex¬ 
ample,  each  SSL  temporal  morphism  used  in  Algorithm  MIX  is  a  uniform  one. 

Loop  Nest  15 

DO  ( i  =  2,  n)  { 

DO  (j  =  2,  n)  { 

DO  ( k  =  2,  n)  { 

51  :  A(i,j,k)  =  A(i  —  1  ,k,j)+  B(i  -  1  ,j,k) 

52  :  B(i,j,k )  =  B(i,j  -  1  ,j)  +  C(i  -  1  ,j,k) 

53  :  C(i,j,  k)  =  k-  1)  +  A(i,j,  k)  }  }  } 

The  initial  B(  1)  is  the  set  {Si,S2,S3}  and  £(1)  is  the  set  {(S*  =»  Sy)  \  (x,y)  6 
{(1,1),  (2, 2),  (3, 3),  (2, 1),  (3, 2),  (1,3)}.  It  is  easy  to  check  that  if  ffi(l)((t,j,  k),  S)  =  i  for 
all  S  6  5(1),  then  /i(</i(l),  Sx,  Sv)  =  1  for  ( x,y )  in  the  set  U  =  {(1, 1),  (2, 1),(3,2)}.  And 
/z(<7i(l),  Sx,  Sy)  =  0  for  ( x,y )  in  ( x,y )  E  {(2, 2),  (3,3),  (1, 3)}.  Consequently,  dependences 
Sx  =>  Sy,  ( x,y )  €  U,  can  be  removed  and  5(1)  is  broken  up  into  three  equivalence  classes 
5(1  o  i),  1  <  i  <  3;  each  contains  a  single  statement  S,. 

Since  5(1  o  1)  is  an  independent  block,  the  schedule  for  Si  will  be  single- sequential 
level.  Furthermore,  S2  and  S3  are  in  different  equivalence  classes  and  they  can  be  scheduled 
independently.  Let  31  (1  o  2)((i,  j,  k),  S )  =  j  and  <?i(l  o3)((i,  j,  k),S)  =  k.  It  is  easy  to  check 
that  n(gi(l  o  2),  S2,  S2)  =  1  and  /i(^i(l  o  3),  S3,  S3)  =  1. 

To  summarize,  the  mixed  schedule  is  single-sequential  level  for  Si  (two-dimensional 
parallelism),  and  is  two-sequential  level  for  S2  and  S3  (one-dimensional  parallelism)  as 
shown  below: 


7r((i,;,fc),S) 


S  =  Si  -*  ( i,  0) 

'  s  =  s2  -f  (*,j,i) 

S  =  S3  ->  (i,k,  2) 

/ 


(67) 


Because  ^(51(1),  Si,  S3)  =  0  and  statements  Si  and  S3  are  in  different  equivalence  classes. 
Si  must  be  in  front  of  S3  in  the  transformed  loop  body.  The  resulting  parallelized  program 
with  imperfectly  nested  sequential  loops  is: 
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Loop  Nest  16 

DO  (i  =  2 ,  n)  { 

DOALL  ((i  =  2,n),(fc  =  2,n)){ 

Si  :  A(i,j,k )  =  A(i  -  B(i  -  1  ,j,k)  } 

DO  (j  =  2,  n)  { 

DOALL  (Jb  =  2,n){ 

52  :  B(i,j,k)  =  B(i,j  —  1  ,j)  +  C(i-  l,j,k )  }  } 

DO  (A:  =  2,  n)  { 

DOALL  (j  =  2,  n)  { 

53  :  C(i,j,k)  =  k  —  1)  +  A(i  -  1,  j,  k)  }  }  } 

8  An  Application:  Dynamic  Programming 

To  illustrate  the  usefulness  of  the  new  scheduling  algorithms,  we  take  dynamic  programming 
as  an  example,  which  has  sequential  complexity  0(n3)  for  a  problem  of  size  n.  The  source 
code  is  given  in  Program  17. 

Loop  Nest  17 

DO  (i  =  1,  n  —  1)  { 

C(i ,  i  +  1)  =  initial  —  values  } 

DO  (t  =  1,  n  -  2)  { 

DO  (jf  =  i  +  2,n)  { 

C(i,j)  =  m\ni<k<j(h(C(i,k),C(k,j)))}  } 

This  source  program  is  first  transformed  to  the  following  form  in  a  systematic  manner 
by  applying  fan-in  and  fan-out  reductions  [10]  to  reduce  potential  concurrent  accesses  of 
variables.  The  result  is  Program  18: 
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Loop  Nest  18 


DO  (i  =  1,  n  -  1)  { 

DO  (jfe  =  l,n){ 

C(i,  i  +  1,  k)  —  initial  -  values  }  } 

DO  (i  =  n~  1,1, -1){ 

DO  ( j  =  i  +  l,n)  { 
m  =  (i  +  j  +  l)/2 
DO  (fc  =  «,j){ 

Sal  ■  tF(*  <  3)  A{i,j,k)  =  A(i,j  —  l,fc) 

Sbi  ■  IF(*+  l  =  k)  B(i,j,k)  =  C(i  +  1  ,j,j) 

Sb2  ■  IF(i  +  1  <  k)  B(i,j,k)  =  B(i  +  1  ,j,k) 

Sc i  :  IF(m  =  k)  C(i,j,k)  =  h!(A(i,j,k),B(i,j,k), 

A(i,j,i  +  j  -  k),B(i,j,i  +  j  -  k )) 

Sc2  :  IF(m  <  k  <  j)C(i,j,k)  =  h2(C(i,j,k-  l),A(i,j,k), 
B(i,j,k),A(i,j,i+  j  -  k),  B(i,j,  i  +  j  -  k)) 

Sc 3  :  IF(fc  =  j)C(i,j,k)  -  C(i,j,k-  1) 

Sa2  :  IF(fc  =  j)  A(i,j,k)  s=  C{i,j,k)  }}} 

Schedules  We  wrote  three  *lisp  programs  on  the  Connection  Machine  CM/2,  each  with 
the  control  structure  generated  by  a  two-sequential  level  uniform  schedule,  a  mixed  statement- 
variant  schedule  and  a  single-sequential  level  subdomain  schedule  respectively.  We  also  have 
a  sequential  Common-Lisp  program  on  the  Symbolics  to  compute  the  same  problem.  The 
three  schedules  are  given  below.  For  simplicity,  we  do  not  give  the  constant  terms  r(S)  of 
function  n. 


two-sequential  level  uniform  schedule: 

*r (S,(i,j,k))=  (j  -  i,k  —  i) 
nr  :ed  statement-variant  schedule: 

'  S  =  Sc2^(j-i,k-  i) 
else  — *■  j  —  i 

single-sequential  level  subdomain  schedule: 

2k  <  0  -»  -I 
-  2k  >  0  —  -t 


*(S,(i,j,k))  = 


gic-octjucimai  icvci  o uuuumaui  duicuuic. 

/  c  (■  ■  . »  f  i  +  j  -  2k  <  0  ->  -2i  +  j  +  k  1 
^  i  +  j  -  2fc  >  0  — >  —i  +  2j  -  k  J 


Experimental  Result  The  experiment  is  conducted  as  follows:  we  run  the  sequential 
code  on  the  Symbolics  and  parallel  codes  on  an  8K-processor  Connection  Machine  with 
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Figure  3:  Running  time  in  seconds. 
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Figure  4:  Running  time  vs.  problem  size. 


Symbolics  as  its  host.  The  results  described  in  Figure  3  and  Figure  4  show  that  the  ver¬ 
sion  using  a  single-sequential  level  subdomain  schedule  is  three  orders  of  magnitude  faster 
than  the  sequential  code,  and  is  two  orders  of  magnitude  faster  than  the  versions  using 
a  two-sequential  level  uniform  schedule  and  mixed  statement-variant  schedule.  And  the 
program  using  a  mixed  statement- variant  schedule  is  about  three  to  four  times  faster  than 
the  program  using  a  two-sequential  level  uniform  schedule. 
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